Most of the data we collect at sea is spatially and temporally correlated, such as the International Bottom Trawl Survey (IBTS) or Beam Trawl survey data (BTS) survey data. Often we want to infer temporal or spatial trends in these data.
The statistical package INLA has the advantage over other software in R that it can combine spatial, temporal, zero-inflated, random effects etc models all in one. It therefore is very powerful and may become your one-stop-to-go package.
This document shows some examples of analyzing survey data using INLA. The first model estimates the number of individuals per haul, taking account of the haul location (assuming spatial autocorrelation) and year (as a 'fixed effect'). Differences in counts as a result of the two surveys are accounted for by also having th survey as a fied effect. Subsequent models take lengths into account in determining the counts. In the final part, the counts are expressed per unit area, so population estimates can be made using a 'swept area' approach.
First, we need to set the path where the data is located and to load the relevant libraries. The inla package can be installed using install.packages(“INLA”, repos=“https://inla.r-inla-download.org/R/stable”). We need the inla package, but also a number of packages for spatial analyses ('rgeos' and 'rgdal') and for plotting the data and the results ('lattice', 'latticeExtra', 'grid','gridExtra','mapdata', and 'maptools').
path <- "~/WMR-INLA/data"
library(INLA); library(fields);
library(lattice); library(latticeExtra); library(grid); library(gridExtra);
library(rgdal); library(rgeos);
library(mapdata); library(maptools)
We will use the BTS and IBTS data set. In this case we downloaded the data for starry ray. These dataset contains the CPUE per length per haul can be dowloaded from datras.ices.dk.
First we read in the IBTS dataset. From these data, the hauls in the North Sea are selected by keeping only those that are in (roundfish)areas 1-7.
cpue_IBTS <- read.csv(file.path(path,"CPUE per length per haul_2017-07-12 11_18_36.csv"), stringsAsFactors = F)
cpue_IBTS <- cpue_IBTS[cpue_IBTS$Area <= 7,]
The lengths are stored in mm and the CPUE is stored as number per hour. “Zero hauls” are included (hauls where no individuals are c$
cpue_IBTS$NoPerHaul <- round(cpue_IBTS$CPUE_number_per_hour*(cpue_IBTS$HaulDur/60))
cpue_IBTS <- cpue_IBTS[!(names(cpue_IBTS) %in% c("DateofCalculation", "Area", "SubArea", "DayNight","Species","Sex","DateTime","CPUE_number_per_hour"))]
cpue_IBTS$uniqueHaul <- paste(cpue_IBTS$Year, cpue_IBTS$Survey, cpue_IBTS$Quarter, cpue_IBTS$Ship, cpue_IBTS$HaulNo, cpue_IBTS$ShootLat, cpue_IBTS$ShootLon, sep="_")
We have a similar but slightly different data set for BTS. However, this dataset has many more columns that we do not really need for these examples, and that we thus remove before combining the IBTS and BTS data sets. Here, the “zero hauls” have lengthclass NA and the number per haul is NA. These observations are then set to zero, so that we have the same structure as in the IBTS data set.
cpue_BTS <- read.csv(file.path(path,"CPUE per length per Hour and Swept Area_2017-07-12 11_29_36.csv"), stringsAsFactors = F)
cpue_BTS <- cpue_BTS[!(names(cpue_BTS) %in% c("Country","DoorType","HaulLat","HaulLong","StatRec","DataType","Rigging","Tickler","Warplngt","TowDir","WindDir","WindSpeed",
"SwellDir","SwellHeight","StNo","Month","Day","TimeShot","Stratum","ICESArea","DateofCalculation", "Area", "SubArea", "DayNight",
"Species", "Sex","SpecVal","SubFactor","GearExp","SweepLngt","Netopening","BycSpecRecCode","StdSpecRecCode","HaulVal","CPUE_number_per_hour",
"CPUE_number_per_km2","SweptArea_km2","DistanceDerived", "HLNoAtLngt"))]
cpue_BTS[is.na(cpue_BTS$LngtClass),]$NoPerHaul <- 0
cpue_BTS[is.na(cpue_BTS$LngtClass),]$LngtClass <- 0
cpue_BTS$surface <- cpue_BTS$Distance * cpue_BTS$BeamWidth
cpue_BTS$uniqueHaul <- paste(cpue_BTS$Year, cpue_BTS$Survey, cpue_BTS$Quarter, cpue_BTS$Ship, cpue_BTS$HaulNo, cpue_BTS$ShootLat, cpue_BTS$ShootLon, sep="_")
The BTS and IBTS datasets need to be combined the two sets into one, using rbind() to combine them by rows. The datasets can only e combined if they have the same columns. The columns that are shared by the two data sets are found using intersect().
cols <- intersect(colnames(cpue_BTS), colnames(cpue_IBTS))
cpue <- rbind(cpue_BTS[,cols],cpue_IBTS[,cols])
Now that the IBTS and BTS data sets are combined we want to make a set where we have counts for all hauls and all lenghts. This means first making an dentifier for each unique haul (based on the information we have for all the hauls). This identifier is used to make a “trawllist” where all the information for the hauls is stored together with its unique identifier.
Once the trawllist is make we use epand.grid() to make a combination of all hauls and lenght classes. This set is merged with our original data set.
trawllist <- cpue[!duplicated(cpue$uniqueHaul),!names( cpue) %in% c("Species","AphiaID","NoPerHaul","Sex", "LngtClass")]
#expand grid
haulsLengths <- expand.grid(uniqueHaul=unique(cpue$uniqueHaul),LngtClass=unique(cpue$LngtClass), stringsAsFactors = F)
full_cpue <- merge(haulsLengths,cpue[names(cpue) %in% c("uniqueHaul", "LngtClass","NoPerHaul")], all.x=T)
rm(haulsLengths)
After we merged all possible combinations with the data we now have NAs for those lengts and hauls where the catch is zero, and so we set those to zero. This data is subsequently merged to the trawllist so that we now have all information together.
full_cpue[is.na(full_cpue$NoPerHaul),]$NoPerHaul <- 0
full_cpue <- merge(full_cpue,trawllist, all.x=T)
The records that have lenghts equal to zero (that indicated zero hauls in our original set ) now need to be removed because we have all the information we need (these hauls now have zero catch for the full length range observed in the survey).
In addition, there are some observations that are highly unlikely: For instance there is a single observation of an individual of 100 cm (in 1977). This is suspicious, and we remove it from the data. Note that when the length information is included in the analyses a further selection is made for the lengths.
#now remove zero lengths
full_cpue <- full_cpue[full_cpue$LngtClass> 0,]
full_cpue <- full_cpue[full_cpue$LngtClass< 990,]
head(full_cpue)
## uniqueHaul LngtClass NoPerHaul Survey Quarter
## 2 1966_NS-IBTS_1_AND_1_54.2333_7.45 10 0 NS-IBTS 1
## 3 1966_NS-IBTS_1_AND_1_54.2333_7.45 20 0 NS-IBTS 1
## 4 1966_NS-IBTS_1_AND_1_54.2333_7.45 30 0 NS-IBTS 1
## 5 1966_NS-IBTS_1_AND_1_54.2333_7.45 50 0 NS-IBTS 1
## 6 1966_NS-IBTS_1_AND_1_54.2333_7.45 60 0 NS-IBTS 1
## 7 1966_NS-IBTS_1_AND_1_54.2333_7.45 70 0 NS-IBTS 1
## Ship Gear HaulNo Year HaulDur ShootLat ShootLong Depth
## 2 AND H18 1 1966 60 54.2333 7.45 40
## 3 AND H18 1 1966 60 54.2333 7.45 40
## 4 AND H18 1 1966 60 54.2333 7.45 40
## 5 AND H18 1 1966 60 54.2333 7.45 40
## 6 AND H18 1 1966 60 54.2333 7.45 40
## 7 AND H18 1 1966 60 54.2333 7.45 40
For our spatial correlation we will need an isomorphic coordinate system. Therefore we transform the latitudes and longitudes to UTM coordinates. these UTM coordinates are given in meters, so we divide them by 1000 to get kilometers.
UTM <- project(cbind(full_cpue$ShootLong, full_cpue$ShootLat), "+proj=utm +zone=31U ellps=WGS84")
full_cpue$Xkm <- UTM[,1]/1000
full_cpue$Ykm <- UTM[,2]/1000
The INLA code does not like special characters in some of the variable names, like the hyphen in “NS-IBTS”. Therefore we rename the survey to NSIBTS.
#recode survey names so that there are no special characters
full_cpue[full_cpue$Survey=="NS-IBTS",]$Survey <- "NSIBTS"
We still have a lot of data (2077880), Because these are just here as example, we will subset the data, and start our analysis from a fiven year. That year is stored in startyr.
# make selection of fullset.
startyr <- 2013
cpue_subset <- full_cpue[full_cpue$Year >= startyr,]
In this example we thus start from 2013. The earlier we start, the more information we will obtain from the analysis on the annual changes in the count data. However, including more data also means we require more memory to store the model, and wait longer for the model to run. If too much data is used in the inla model, so that more computer memory is required than you have available, the model will crash. Now we have 267315 rows left in cpue_subset. We start with a very simple analysis where we do not take length into account, and only analyse numbers per haul. Hence we aggregate the counts per haul.
cpue_subset <- aggregate(NoPerHaul ~ uniqueHaul + Survey + Quarter + Ship + Gear + HaulNo + Year + HaulDur + ShootLong + ShootLat + Xkm + Ykm + Depth, data= cpue_subset,FUN="sum")
After the aggregation (now that the length information is removed) we have 3514 rows in the data set. To give us a quick impression of the data we plot it in a map. The two surveys have different colors, red being the IBTS hauls and black being the BTS hauls. The size of the circles indicate the counts in each haul
plot(cpue_subset$ShootLong,cpue_subset$ShootLat,
cex=(cpue_subset$NoPerHaul)/10+0.1,
col=as.numeric(as.factor(cpue_subset$Survey)),
xlab="Longitude", ylab="Lattitude")
map("world",add=T)
Clearly the IBTS hauls cover a larger part of the North Sea, but the large catches for both are located in the same areas.
We also plot a quick histogram of the counts, to give us an impression of the statistical distribution of the data.
hist(cpue_subset$NoPerHaul, 200, xlab= "Number per haul")
The count data is clearly quite skewed, with lots of zeros and small values, and very few large counts. Because of this distribution and because these are counts, a Poisson and a negative binomial distribution will be used in the inla model lateron.
The UTM coordinates of the data set are combined into a Loc (location) dataset. That data set will be used later to generate spatial meshes for the data.
Loc <- cbind(cpue_subset$Xkm , cpue_subset$Ykm )
Next we need a mesh for the spatial data. Because we do not want our spatial correlations to pass landmasses (e.g. Denmark) we first make a nonconvex hull of the data points using inla.nonconvex.hull(). This convex hull is used as a boundary for making a 2d mesh. The generation of the actual mesh is done using inla.mesh.2d(). That function takes several arguments, incuding “cutoff” and “max.edge”. These arguments specify how fine the final mesh will be. Finer meshes will be able to capture smaller scale spatial correlations, but require more comuting time in the inla model.
ConvHull <- inla.nonconvex.hull(Loc, convex=-0.02, resolution=90)
mesh1a <- inla.mesh.2d(boundary = ConvHull, max.edge=c(100))
As an alternative to using the nonconvex hull function to generate a boundary,, we can also take the shapefile of the North Seaas the boundary of the mesh, which makes prediction in the future easier, as we know no record can be found outside the North Sea area (give or take evolution). We first load the ICES shapefiles which were downloaded from http://gis.ices.dk/sf/. We load this shapefiles, merge layers together, transform to UTM, convert UTM to km rather than meters and finally create a mesh.
ICESareas <- readShapePoly(file.path(path,"ICES_Areas_20160601_dense"))
NorthSea <- subset(ICESareas,SubArea==4)
NorthSea <- gUnionCascaded(NorthSea)
proj4string(NorthSea) <- c("+proj=longlat")
NorthSeaUTM <- spTransform(NorthSea,CRS("+proj=utm +zone=31"))
NS.border <- inla.sp2segment(NorthSeaUTM)
NS.border$loc <- NS.border$loc/1000
mesh1b <- inla.mesh.2d(boundary=NS.border, cutoff=3,max.edge=c(30))
The meshes can be plotted using the plot() function on the mesh object. Once the mesh is plotted, the locations of the samples (stored in the Loc object) can be overlayed using points(). Below, the two meshes are plotted side-by-side.
par(mfrow=c(1,2))
plot(mesh1a)
points(Loc, col = 1, pch = 16, cex = 0.3)
plot(mesh1b)
points(Loc, col = 1, pch = 16, cex = 0.3)
Once the 2d mesh is made we construct a observation/prediction weight matrix for the model. This is also called the “projector matrix”.
# 2. Define the weighting factors a_ik (also called the projector matrix).
A1 <- inla.spde.make.A(mesh = mesh1a, loc = Loc)
dim(A1)
## [1] 3514 402
The first dimension of the projector matrix has the size of the number of observations (here 3514), and the second dimension of the projector matrix is the number of nodes in the mesh (here 402).
# 3. Define the spde
spde <- inla.spde2.matern(mesh1a)
w.st <- inla.spde.make.index('w', n.spde = spde$n.spde)
The stack allows INLA to build models with complex linear predictors. Here have a SPDE model combined with covariate fixed effects and an intercept at n hauls.
Before making the stack we need to convert all fixed effects that are factors in the INLA model.
cpue_subset$fYear <- as.factor(cpue_subset$Year)
cpue_subset$fSurvey <- as.factor(cpue_subset$Survey)
Note that in code below the names in the model matrix should not contain any special characters! This is why we renamd the “NS-IBTS” survey" to “NSIBTS”
# 5. Make a stack. In this process we tell INLA at which points on the mesh we sampled the response variable and the covariates.
Xmatrix <- model.matrix(~ fYear + fSurvey + HaulDur, data = cpue_subset)
head(Xmatrix)
## (Intercept) fYear2014 fYear2015 fYear2016 fYear2017 fSurveyNSIBTS
## 1 1 0 1 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 0 1 0 0 0
## 5 1 0 1 0 0 0
## 6 1 0 1 0 0 0
## HaulDur
## 1 30
## 2 30
## 3 30
## 4 30
## 5 30
## 6 30
This Xmatrix contains the model matrix with the fixed effects, including the intercept (The column for the intercept is named “(Intercept)”, and it is 1 for all observations). However, in the next step the intercept is removed from the model matrix. The intercept is then included when making the stack, and named “Intercept” (without brackets).
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2014 fYear2015 fYear2016 fYear2017 fSurveyNSIBTS HaulDur
## 1 0 1 0 0 0 30
## 2 1 0 0 0 0 30
## 3 1 0 0 0 0 30
## 4 0 1 0 0 0 30
## 5 0 1 0 0 0 30
## 6 0 1 0 0 0 30
N <- nrow(cpue_subset)
Stack1 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A1),
effects = list(
Intercept=rep(1,N),
X=X, #Covariates
w=w.st)) #Spatial field
The model formula used in the inla model is generated from the names of the model matrix, combined with the intercept term and the spatial correlation model (“f(w, model=spde)”).
Subsequently, two inla models are run, one assuming that the data are Poisson distributed, and another model assuming that the data are negative binomial distributed.
fsp <- parse(text=c("y ~ -1 + Intercept + ",paste(c(names(X)," f(w, model = spde)"),collapse =" + ")))
INLA:::inla.dynload.workaround()
I1p <- inla(eval(fsp), family = "poisson", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
I1nb <- inla(eval(fsp), family = "nbinomial", data=inla.stack.data(Stack1),
control.compute = list(dic = TRUE, waic = TRUE, config=TRUE),
control.predictor = list(A = inla.stack.A(Stack1)))
Once the INLA models are run, a summary can be printed for the two models. This summary contains much of the relevant information for the models.
summary(I1p)
##
## Call:
## c("inla(formula = eval(fsp), family = \"poisson\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE), control.predictor = list(A = inla.stack.A(Stack1)))" )
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 1.0680 13.3102 0.5748 14.9530
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -1.7055 1.1061 -3.9659 -1.6907 0.4736 -1.6651 0
## fYear2014 0.1091 0.0472 0.0163 0.1092 0.2016 0.1093 0
## fYear2015 0.1115 0.0448 0.0235 0.1115 0.1995 0.1116 0
## fYear2016 0.0040 0.0452 -0.0849 0.0040 0.0927 0.0041 0
## fYear2017 -0.0213 0.0822 -0.1849 -0.0206 0.1380 -0.0192 0
## fSurveyNSIBTS -1.6179 0.0396 -1.6958 -1.6178 -1.5404 -1.6177 0
## HaulDur 0.0264 0.0047 0.0173 0.0263 0.0356 0.0263 0
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Theta1 for w 2.255 0.1309 2.001 2.254 2.517 2.248
## Theta2 for w -4.364 0.1822 -4.732 -4.360 -4.014 -4.346
##
## Expected number of effective parameters(std dev): 116.14(3.742)
## Number of equivalent replicates : 30.26
##
## Deviance Information Criterion (DIC) ...: 8301.77
## Effective number of parameters .........: 114.74
##
## Watanabe-Akaike information criterion (WAIC) ...: 8619.12
## Effective number of parameters .................: 387.72
##
## Marginal log-Likelihood: -4321.40
## Posterior marginals for linear predictor and fitted values computed
summary(I1nb)
##
## Call:
## c("inla(formula = eval(fsp), family = \"nbinomial\", data = inla.stack.data(Stack1), ", " control.compute = list(dic = TRUE, waic = TRUE, config = TRUE), ", " control.predictor = list(A = inla.stack.A(Stack1)))")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.8344 12.9337 0.7413 14.5093
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode kld
## Intercept -2.4482 1.7544 -6.1411 -2.4389 1.2136 -2.4184 0
## fYear2014 0.2498 0.1074 0.0395 0.2496 0.4612 0.2492 0
## fYear2015 0.3896 0.1072 0.1797 0.3893 0.6004 0.3889 0
## fYear2016 0.3685 0.1067 0.1595 0.3683 0.5785 0.3678 0
## fYear2017 0.2941 0.1486 0.0031 0.2939 0.5865 0.2933 0
## fSurveyNSIBTS -1.2413 0.0897 -1.4176 -1.2412 -1.0653 -1.2411 0
## HaulDur 0.0436 0.0080 0.0279 0.0436 0.0593 0.0436 0
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd
## size for the nbinomial observations (1/overdispersion) 0.6986 0.0451
## Theta1 for w 2.7702 0.1402
## Theta2 for w -4.8887 0.2533
## 0.025quant 0.5quant
## size for the nbinomial observations (1/overdispersion) 0.6138 0.6973
## Theta1 for w 2.5059 2.7653
## Theta2 for w -5.4193 -4.8745
## 0.975quant mode
## size for the nbinomial observations (1/overdispersion) 0.7912 0.6948
## Theta1 for w 3.0575 2.7476
## Theta2 for w -4.4274 -4.8233
##
## Expected number of effective parameters(std dev): 88.52(5.013)
## Number of equivalent replicates : 39.70
##
## Deviance Information Criterion (DIC) ...: 6358.76
## Effective number of parameters .........: 88.19
##
## Watanabe-Akaike information criterion (WAIC) ...: 6365.30
## Effective number of parameters .................: 86.17
##
## Marginal log-Likelihood: -3292.18
## Posterior marginals for linear predictor and fitted values computed
idx <- inla.stack.index(Stack1, tag= 'Fit')$data
quantile(I1nb$summary.fitted.values[idx,"mean"])
## 0% 25% 50% 75% 100%
## 9.015974e-04 3.090979e-02 2.637386e-01 1.133827e+00 3.271924e+01
quantile(cpue_subset$NoPerHaul)
## 0% 25% 50% 75% 100%
## 0 0 0 1 100
hist(I1nb$summary.fitted.values[idx,"mean"],200, xlim=c(0,50))
hist(cpue_subset$NoPerHaul,200, xlim=c(0,50))
wproj <- inla.mesh.projector(mesh1a, xlim = range(Loc[,1]), ylim = range(Loc[,2]))
wm.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$mean)
wsd.pm100100 <- inla.mesh.project(wproj, I1nb$summary.random$w$sd)
grid <- expand.grid(x = wproj$x, y = wproj$y)
grid$zm <- as.vector(wm.pm100100)
grid$zsd <- as.vector(wsd.pm100100)
wld <- map('world', xlim=c(-5,15), ylim=c(47,62),plot=FALSE)
wld <- data.frame(lon=wld$x, lat=wld$y)
UTMmap <- project(cbind(wld$lon, wld$lat), "+proj=utm +zone=31U ellps=WGS84")
UTMmapFinal <- data.frame("xm" =UTMmap[,1]/1e3, "ym" = UTMmap[,2]/1e3)
p1 <- levelplot(zm ~ x * y ,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior mean spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset$Xkm,
y = cpue_subset$Ykm,
pch = 1,
size = unit(cpue_subset$NoPerHaul/15+0.1, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
p2 <- levelplot(zsd ~ x * y,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior sd spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset$Xkm,
y = cpue_subset$Ykm,
pch = 1,
size = unit(cpue_subset$NoPerHaul/15+0.1, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black')
grid.arrange(p1,p2, ncol=2)
tcoo <- rbind(c(0.3, 0.3), c(0.5, 0.5), c(0.7, 0.7))
inla.nonconvex.hull
dim(Ap <- inla.spde.make.A(mesh = mesh1, loc = tcoo))
## [1] 3 505
x0 <- c(0.5, 0.5, 0.5)
#To do a fully Bayesian analysis, we include the target locations on the estimation process by
#assigning NA for the response at these locations. Defining the prediction stack
stk.pred <- inla.stack(tag='pred', A=list(Ap, 1), data=list(y=NA), ## response as NA
effects=list(s=1:spde$n.spde, data.frame(x=x0, b0=1)))
#Fit the model again with the full stack
stk.full <- inla.stack(stk.e, stk.pred)
p.res <- inla(formula, data=inla.stack.data(stk.full), ## full stack
control.predictor=list(compute=TRUE, ## compute the predictor
A=inla.stack.A(stk.full))) ## using full stack data
#Get the prediction data index and collect the PMD to work with
pred.ind <- inla.stack.index(stk.full, tag = "pred")$data
ypost <- p.res$marginals.fitted.values[pred.ind]
#Visualize PMD for the linear predictor at target locations with commands bellow
names(ypost) <- paste('y', seq_along(ypost), sep='_'); library(plyr)
xyplot(y~x | .id, ldply(ypost), panel='llines', xlab='y', ylab='Density')
#In INLA we have some functions to work with marginals distributions
apropos("marginal")
## [1] "inla.dmarginal" "inla.emarginal" "inla.hpdmarginal"
## [4] "inla.mmarginal" "inla.pmarginal" "inla.qmarginal"
## [7] "inla.rmarginal" "inla.smarginal" "inla.tmarginal"
inla.mmarginal(ypost[[1]]) ## mode
## [1] 1.697
inla.qmarginal(c(0.15, 0.7), ## quantiles
ypost[[1]])
## [1] 1.345 1.875
inla.pmarginal(inla.qmarginal(
0.3, ypost[[1]]), ypost[[1]])
## [1] 0.3
Owing to ontogenetic niche shifts we expect that the spatial distribution of small indivduals is different from the spatial distribution of large indivuals. Hence we want to include a length component in the spatial distribution of the counts.
First we need to make a new subset of the data that includes the length information (and leaving out the aggregation step in our earlier analysis). We use the same starting year as was done for the previous analysis: 2013.
# make selection of fullset.
cpue_subset <- full_cpue[full_cpue$Year >= startyr,]
Let's make 5 cm classes instead of 1 cm classes. This reduces the number of observations. We use round to achieve this. Remember that the units of the length measurements is mm.
Once we have 5 cm lenght classes, we need aggregate() to sum the numbers per haul within our new length bins. Becaus we need the other info in the data set as well we include all variables in the aggregate.
cpue_subset$LngtClass <- round(cpue_subset$LngtClass/50)*50
cpue_subset <- aggregate(NoPerHaul~ LngtClass + uniqueHaul + Survey + Quarter + Ship + Gear + HaulNo + Year + HaulDur + ShootLong +
ShootLat + Xkm + Ykm + Depth, data= cpue_subset,FUN="sum")
Below, we inspect the length range of observations. There are no individuals with lengths over 600 mm observed. Hence, we remove those lengthclasses. If there are too many subsequent lenght classes with only zeros, the model will fail, giving a warning that one of the eigenvalues is negative.
aggregate(NoPerHaul~LngtClass, data=cpue_subset, FUN= "sum")
## LngtClass NoPerHaul
## 1 0 0
## 2 50 6
## 3 100 335
## 4 150 265
## 5 200 304
## 6 250 410
## 7 300 579
## 8 350 651
## 9 400 903
## 10 450 560
## 11 500 104
## 12 550 10
## 13 600 1
## 14 650 0
## 15 700 0
## 16 750 0
## 17 800 0
## 18 850 0
cpue_subset <- cpue_subset[cpue_subset$LngtClass< 650,]
Because we now have combinations for all hauls and several length classes, the subset of data (since 2009) is still very big. It has45682 observations. In order to reduce the number of observaions used in the INLA model, we take all non-zero observations but we subsample the zero observations. The remaining zero observations willneed to receive a higher weight in the final model. Note that this is not yet implemented.
#cpue_subset_temp <- cpue_subset[cpue_subset$NoPerHaul > 0,]
#cpue_subset_temp <-rbind(cpue_subset_temp, cpue_subset[sample(which(cpue_subset$NoPerHaul==0),40000),])
#cpue_subset <- cpue_subset_temp
As before, The UTM coordinates of the observations are combined into a Loc (location) dataset. That dataset is used to create a mesh in the next step.
Loc <- cbind(cpue_subset$Xkm , cpue_subset$Ykm )
Next we need a mesh for the spatial data. Because we do not want our spatial correlations to pass landmasses (e.g. Denmark) we first make a convex hull of the data points using inla.nonconvex.hull(). This convex hull is used as a boundary for making a 2d mesh.
ConvHull <- inla.nonconvex.hull(Loc, convex=-0.02, resolution=90)
mesh2a <- inla.mesh.2d(boundary = ConvHull, max.edge=c(100))
plot(mesh2a)
points(Loc, col = 1, pch = 16, cex = 0.5)
Next we inspect the length range in the data by making a histogram of length observations. This histogram can be used to define the locations ofa number of “knots” along the length range that we will later use for our analysis. More knots means a longer computing time (but a higher degree of flexibility in the length component of the spatial correlation).
There are no individuals with lengths over 680 mm observed. Hence, we remove those lengthclasses (with only zeros) from the data set and define the knots to be within the new length range
hist(cpue_subset$LngtClass,200)
knots <- seq(from = 50, to = 550, by = 100)
knots
## [1] 50 150 250 350 450 550
Using the 6 knots as locations, we make a 1 dimensional mesh.
# One-dimensional mesh for length class. See the time series module
mesh.t <- inla.mesh.1d(loc = knots)
In this mesh.t object, the dimensions can be checked using mesh.t$loc (for the locations) and mesh.t$n (the number of locations). The code below confirms that there are 5 locations, and those are at the values of the knots object.
mesh.t$n
## [1] 6
mesh.t$loc
## [1] 50 150 250 350 450 550
Now that there is a 1 dimensional mesh for the lengths, we use it to construct a observation/prediction weight matrix (“projector matrix”) based on the spatial mesh that we already created earlier (mesh1) and our new mesh for the lengths. The lengths are used in the “group model”. The new projector matrix is names A2 to distinguish it from the projector matrix of the previous model.
# 2. Define the weighting factors a_ik (projector matrix).
NGroups <- length(knots)
A2 <- inla.spde.make.A(mesh = mesh2a,
loc = Loc,
group = cpue_subset$LngtClass,
group.mesh = mesh.t)
dim(A2)
## [1] 45682 2412
# 3. Define the spde
spde <- inla.spde2.matern(mesh2a)
We need to make an inla.spde model object for a Matern model, but we still have that available from the model without size structure. That object was named “spde”. We use it to make a list of named index vectors for the SPDE model. Note that the command for making the list of index vectors now includes an argument for the groups.
w.st <- inla.spde.make.index('w',
n.spde = spde$n.spde,
n.group = NGroups)
Before making the stack we need to convert all fixed effects that are factors in the INLA model.
cpue_subset$fYear <- as.factor(cpue_subset$Year)
cpue_subset$fSurvey <- as.factor(cpue_subset$Survey)
Next we make a new stack. For this we need a model matrix. Although the fixed effects are the same as in the previous model, we still need to make a new model matrix because the data now include the length structure.
# 5. Make a stack.
Xmatrix <- model.matrix(~ fYear + fSurvey + HaulDur, data = cpue_subset)
head(Xmatrix)
## (Intercept) fYear2014 fYear2015 fYear2016 fYear2017 fSurveyNSIBTS
## 1 1 0 1 0 0 0
## 2 1 0 1 0 0 0
## 3 1 0 1 0 0 0
## 4 1 0 1 0 0 0
## 5 1 0 1 0 0 0
## 6 1 0 1 0 0 0
## HaulDur
## 1 30
## 2 30
## 3 30
## 4 30
## 5 30
## 6 30
This Xmatrix contains the model matrix with the fixed effects, including the intercept (The column for the intercept is named “(Intercept)”, and it is 1 for all observations). However, in the next step the intercept is removed from the model matrix. The intercept is then included when making the stack, and named “Intercept” (without brackets).
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2014 fYear2015 fYear2016 fYear2017 fSurveyNSIBTS HaulDur
## 1 0 1 0 0 0 30
## 2 0 1 0 0 0 30
## 3 0 1 0 0 0 30
## 4 0 1 0 0 0 30
## 5 0 1 0 0 0 30
## 6 0 1 0 0 0 30
N <- nrow(cpue_subset)
Stack2 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A2),
effects = list(
Intercept = rep(1, N),
X = as.data.frame(X), # Covariates
w.st)) # Spatial-temp field
fsp <- parse(text=c("y ~ -1 + Intercept + ",
paste(c(names(X)," f(w, model = spde, group = w.group, control.group = list(model = 'ar1'))"),collapse =" + ")))
#INLA:::inla.dynload.workaround()
I2nb <- inla(eval(fsp), family = "nbinomial",
data=inla.stack.data(Stack2),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack2)))
summary(I2nb)
##
## Call:
## c("inla(formula = eval(fsp), family = \"nbinomial\", data = inla.stack.data(Stack2), ", " control.compute = list(dic = TRUE, waic = TRUE), control.predictor = list(A = inla.stack.A(Stack2)))" )
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 1.0349 1516.9950 4.7789 1522.8089
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.975quant mode
## Intercept -2.5992 18.1630 -33.7041 -6.6225 33.0557 -6.5965
## fYear2014 0.3656 0.4334 -0.1090 0.1788 1.3145 0.1428
## fYear2015 0.2419 0.1486 0.0394 0.2157 0.7000 0.2019
## fYear2016 0.2521 0.1490 0.0572 0.2245 0.7218 0.2118
## fYear2017 0.0481 0.3974 -1.4679 0.0906 0.7015 0.0918
## fSurveyNSIBTS -1.4360 0.3366 -1.9532 -1.5022 -0.7166 -1.4888
## HaulDur 0.0377 0.0102 0.0218 0.0359 0.0624 0.0339
## kld
## Intercept 0.0018
## fYear2014 0.0000
## fYear2015 0.0000
## fYear2016 0.0000
## fYear2017 0.0000
## fSurveyNSIBTS 0.0000
## HaulDur 0.0000
##
## Random effects:
## Name Model
## w SPDE2 model
##
## Model hyperparameters:
## mean sd
## size for the nbinomial observations (1/overdispersion) 0.4047 0e+00
## Theta1 for w 2.4403 1e-04
## Theta2 for w -4.6484 3e-03
## GroupRho for w 0.7709 5e-04
## 0.025quant 0.5quant
## size for the nbinomial observations (1/overdispersion) 0.4045 0.4046
## Theta1 for w 2.4398 2.4403
## Theta2 for w -4.6541 -4.6485
## GroupRho for w 0.7698 0.7710
## 0.975quant mode
## size for the nbinomial observations (1/overdispersion) 0.4216 0.4046
## Theta1 for w 2.4840 2.4402
## Theta2 for w -4.6422 -4.6490
## GroupRho for w 0.7715 0.7714
##
## Expected number of effective parameters(std dev): 256.53(1.199)
## Number of equivalent replicates : 178.08
##
## Watanabe-Akaike information criterion (WAIC) ...: 14418438.38
## Effective number of parameters .................: 7199975.03
##
## Marginal log-Likelihood: -7998.51
## Posterior marginals for linear predictor and fitted values computed
We still have The “UTMmap” object for creating the maps from the previous analysis.
w <- I2nb$summary.random$w$mean
# length of w is mesh$n * NGroups
wproj <- inla.mesh.projector(mesh2a, xlim = range(Loc[,1]), ylim = range(Loc[,2]))
grid <- expand.grid(length=knots, x = wproj$x, y = wproj$y,zm=NA)
for (i in knots){
w.pm100100 <- inla.mesh.project(wproj,w[w.st$w.group==which(i==knots)])
grid[grid$length==i,]$zm <- as.vector(w.pm100100) }
Next we print the grid, which is now estimated at each knot. The observed counts for the length class at each knot are included in each panel.
print(levelplot(zm ~ x * y |length,
data = grid,
scales = list(draw = TRUE),
xlab = list("Easting", cex = 1),
ylab = list("Northing", cex = 1),
main = list("Posterior mean spatial random fields", cex = 1),
col.regions=tim.colors(25, alpha = 1),
panel=function(x, y, z, subscripts,...){
panel.levelplot(x, y, z, subscripts,...)
grid.points(x = cpue_subset[cpue_subset$LngtClass == grid[subscripts[1],]$length,]$Xkm,
y = cpue_subset[cpue_subset$LngtClass == grid[subscripts[1],]$length,]$Ykm,
pch = 1,
size = unit(cpue_subset[cpue_subset$LngtClass ==grid[subscripts[1],]$length,]$NoPerHaul/15, "char"))
}) + xyplot(ym~ xm, UTMmapFinal, type='l', lty=1, lwd=0.5, col='black'))
# 5. Make a stack.
Xmatrix <- model.matrix(~ fYear + fSurvey + HaulDur, data = cpue_subset)
head(Xmatrix)
## (Intercept) fYear2014 fYear2015 fYear2016 fYear2017 fSurveyNSIBTS
## 1 1 0 1 0 0 0
## 2 1 0 1 0 0 0
## 3 1 0 1 0 0 0
## 4 1 0 1 0 0 0
## 5 1 0 1 0 0 0
## 6 1 0 1 0 0 0
## HaulDur
## 1 30
## 2 30
## 3 30
## 4 30
## 5 30
## 6 30
X <- as.data.frame(Xmatrix[,-1])
names(X) <- c(gsub("[:]",".",names(X)))
head(X)
## fYear2014 fYear2015 fYear2016 fYear2017 fSurveyNSIBTS HaulDur
## 1 0 1 0 0 0 30
## 2 0 1 0 0 0 30
## 3 0 1 0 0 0 30
## 4 0 1 0 0 0 30
## 5 0 1 0 0 0 30
## 6 0 1 0 0 0 30
N <- nrow(cpue_subset)
Stack3 <- inla.stack(
tag = "Fit",
data = list(y = cpue_subset$NoPerHaul),
A = list(1,1, A2),
effects = list(
Intercept = rep(1, N),
X = as.data.frame(X), # Covariates
w.st)) # Spatial-temp field
fsp <- parse(text=c("y ~ -1 + Intercept + ",
paste(c(names(X)," f(w, model = spde, group = w.group, control.group = list(model = 'ar1'))", "f(LngtClass, group=fSurvey,'rw2')",collapse =" + ")))
#INLA:::inla.dynload.workaround()
I3nb <- inla(eval(fsp), family = "nbinomial",
data=inla.stack.data(Stack3),
control.compute = list(dic = TRUE, waic = TRUE),
control.predictor = list(A = inla.stack.A(Stack3)))
summary(I3nb)
Reread IBTS data
cpue_IBTS <- read.csv(file.path(path,"CPUE per length per haul_2017-07-12 11_18_36.csv"), stringsAsFactors = F)
cpue_IBTS$NoPerHaul <- round(cpue_IBTS$CPUE_number_per_hour*(cpue_IBTS$HaulDur/60))
cpue_IBTS <- cpue_IBTS[!(names(cpue_IBTS) %in% c("DateofCalculation", "Area", "SubArea", "DayNight","Species","Sex","DateTime","CPUE_number_per_hour"))]
cpue_IBTS$uniqueHaul <- paste(cpue_IBTS$Year, cpue_IBTS$Survey, cpue_IBTS$Quarter, cpue_IBTS$Ship, cpue_IBTS$HaulNo, cpue_IBTS$ShootLat, cpue_IBTS$ShootLon, sep="_")
File contains HH records in exchange files for IBTS. NA for door spread and distance are coded as -9, so make NAs of those.
HH_IBTS <- read.csv(file.path(path,"Exchange Data_2017-07-14 09_26_25.csv"), stringsAsFactors = F)
HH_IBTS$uniqueHaul <- paste(HH_IBTS$Year, HH_IBTS$Survey, HH_IBTS$Quarter, HH_IBTS$Ship, HH_IBTS$HaulNo, HH_IBTS$ShootLat, HH_IBTS$ShootLon, sep="_")
HH_IBTS[HH_IBTS$DoorSpread == -9,]$DoorSpread <- NA
HH_IBTS[HH_IBTS$Distance == -9,]$Distance <- NA
HH_IBTS[HH_IBTS$Netopening == -9,]$Netopening <- NA
HH_IBTS$surface <- HH_IBTS$Distance * HH_IBTS$Netopening
Select only relevant variables, and remove all hauls longer than 90 minutes
HH_IBTS <- HH_IBTS[names(HH_IBTS) %in% c("uniqueHaul", "surface", "HaulDur","Doorspread","Ditance","Netopening","Warplngt")]
HH_IBTS <- HH_IBTS[HH_IBTS$HaulDur<=90,]
Plot surface against duration hoping to find corr
plot(x=HH_IBTS$HaulDur,y=HH_IBTS$surface)
summary(lm(surface~HaulDur, data=HH_IBTS))
##
## Call:
## lm(formula = surface ~ HaulDur, data = HH_IBTS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34931 -2186 -134 1979 34770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 229.092 139.597 1.641 0.101
## HaulDur 587.464 4.482 131.067 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3531 on 14777 degrees of freedom
## (12906 observations deleted due to missingness)
## Multiple R-squared: 0.5376, Adjusted R-squared: 0.5375
## F-statistic: 1.718e+04 on 1 and 14777 DF, p-value: < 2.2e-16
Merge haulinfo to full cpue set
cpue_IBTS <- merge(cpue_IBTS, HH_IBTS, all.x=T, all.y=F)
cpue_BTS <- read.csv(file.path(path,"CPUE per length per Hour and Swept Area_2017-07-12 11_29_36.csv"), stringsAsFactors = F)
cpue_BTS <- cpue_BTS[!(names(cpue_BTS) %in% c("Country","DoorType","HaulLat","HaulLong","StatRec","DataType","Rigging","Tickler","Warplngt","TowDir","WindDir","WindSpeed","SwellDir","SwellHeight","StNo","Month","Day","TimeShot","Stratum","ICESArea","DateofCalculation", "Area", "SubArea", "DayNight","Species", "Sex","SpecVal","SubFactor","GearExp","SweepLngt","Netopening","BycSpecRecCode","StdSpecRecCode","HaulVal","CPUE_number_per_hour","CPUE_number_per_km2","SweptArea_km2","DistanceDerived", "HLNoAtLngt"))]
surface
## function (object, ...)
## {
## UseMethod("surface")
## }
## <environment: namespace:fields>
cpue_BTS[is.na(cpue_BTS$LngtClass),]$NoPerHaul <- 0
cpue_BTS[is.na(cpue_BTS$LngtClass),]$LngtClass <- 0
cpue_BTS$surface <- cpue_BTS$Distance * cpue_BTS$BeamWidth
cpue_BTS$uniqueHaul <- paste(cpue_BTS$Year, cpue_BTS$Survey, cpue_BTS$Quarter, cpue_BTS$Ship, cpue_BTS$HaulNo, cpue_BTS$ShootLat, cpue_BTS$ShootLon, sep="_")
Next we combine the two sets into one.
cols <- intersect(colnames(cpue_BTS), colnames(cpue_IBTS))
cpue <- rbind(cpue_BTS[,cols],cpue_IBTS[,cols])
Now that the IBTS and BTS data sets are combined we want to make a set where we have counts for all hauls and all lenghts. This means first making an dentifier for each unique haul (based on the information we have for all the hauls). This identifier is used to make a “trawllist” where all the information for the hauls is stored.
Once the trawllist is make we use epand.grid() to make a combination of all hauls and lenght classes. This set is merged with our original data set.
trawllist <- cpue[!duplicated(cpue$uniqueHaul),!names( cpue) %in% c("Species","AphiaID","NoPerHaul","Sex", "LngtClass")]
#expand grid
haulsLengths <- expand.grid(uniqueHaul=unique(cpue$uniqueHaul),LngtClass=unique(cpue$LngtClass), stringsAsFactors = F)
full_cpue <- merge(haulsLengths,cpue[names(cpue) %in% c("uniqueHaul", "LngtClass","NoPerHaul")], all.x=T)
rm(haulsLengths)
head(full_cpue)
## uniqueHaul LngtClass NoPerHaul
## 1 1966_NS-IBTS_1_AND_1_54.2333_7.45 0 0
## 2 1966_NS-IBTS_1_AND_1_54.2333_7.45 10 NA
## 3 1966_NS-IBTS_1_AND_1_54.2333_7.45 20 NA
## 4 1966_NS-IBTS_1_AND_1_54.2333_7.45 30 NA
## 5 1966_NS-IBTS_1_AND_1_54.2333_7.45 50 NA
## 6 1966_NS-IBTS_1_AND_1_54.2333_7.45 60 NA
After we merged all possible combinations with the data we now have NAs for those lengts and hauls where the catch is zero, and so we set those to zero. This data is subsequently merged to the trawllist so that we now have all information together.
full_cpue[is.na(full_cpue$NoPerHaul),]$NoPerHaul <- 0
full_cpue <- merge(full_cpue,trawllist, all.x=T)
The records that have lenghts equal to zero (that indicated zero hauls in our original set ) now need to be removed because we have all the information we need (these hauls now have zero catch for the full length range observed in the survey).
In addition, there are some observations that are highly unlikely: For instance there is a single observation of an individual of 100 cm (in 1977). This is suspicious, and we remove it from the data.
#now remove zero lengths
full_cpue <- full_cpue[full_cpue$LngtClass> 0,]
full_cpue <- full_cpue[full_cpue$LngtClass< 990,]
For our spatial correlation we will need an isomorphic coordinate system. Therefore we transform the latitudes and longitudes to UTM coordinates. these UTM coordinates are given in meters, so we divide them by 1000 to get kilometers.
UTM <- project(cbind(full_cpue$ShootLong, full_cpue$ShootLat), "+proj=utm +zone=31U ellps=WGS84")
full_cpue$Xkm <- UTM[,1]/1000
full_cpue$Ykm <- UTM[,2]/1000
The INLA code does not like special characters in some of the variable names, like the hyphen in “NS-IBTS”. Therefore we rename the survey to NSIBTS.
#recode survey names so that there are no special characters
full_cpue[full_cpue$Survey=="NS-IBTS",]$Survey <- "NSIBTS"
# make selection of fullset.
cpue_subset <- full_cpue[full_cpue$Year >= startyr,]
Do actual INLA for counts per km2 :)
Next we want to simulate 1000 realizations and integrate over surface (per year) so that we get a population level estimate.
# Simulate from model I1nb
# Simulation study spatial Negbin model.
# First set random seed and the number of simulations
set.seed(12345)
NSim <- 1000
# Carry out the simulation
Sim <- inla.posterior.sample(n = NSim, result = I1nb)
# Now we have 1000 simulated betas and also 1000 spatial fields w
# Use these to calculate 1000 times the expected values mu,
# and simulate count data from these using rpois
# X matrix
#Xm <- model.matrix(~BreedingD.std + Rain.std, data = cpue_subset)
Xm <- model.matrix(~ fYear + fSurvey + HaulDur , data = cpue_subset)
# Create space to store the information
N <- nrow(cpue_subset)
Ysim <- matrix(nrow = N, ncol = NSim)
mu.i <- matrix(nrow = N, ncol = NSim)
Xm <- as.matrix(Xm)
Am <- as.matrix(A3)
# We are calculating a Fixed part (X * beta) and also the random part (A * w)
for (i in 1: NSim){
Betas <- Sim[[i]]$latent[c(987, 988, 989)]
wk <- Sim[[i]]$latent[416:986]
eta <- Xm %*% Betas + Am %*% wk
mu.i[,i] <- exp(eta)
Ysim[,i] <- rpois(n = nrow(OCW), lambda = mu.i[,i])
}
# Now we have 1,000 simulated count data sets from the model
# And plot the results of the simulation study
Z <- matrix(nrow = max(Ysim)+1, ncol = NSim)
for (i in 1: NSim){
zi <- table(Ysim[,i])
I <- as.numeric(names(zi)) + 1
Z[I,i] <- zi
}